The bioinformatics analysis pipeline nfcore/ampliseq is used for amplicon sequencing, supporting denoising of any amplicon and supports a variety of taxonomic databases for taxonomic assignment including 16S, ITS, CO1 and 18S.
Pipeline input was saved in folder input.
Sequencing data was provided in the samplesheet file
samplesheet.csv that is displayed below:
Metadata associated with the sequencing data was provided in
metadata.tsv and is displayed below:
FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. The sequence quality was checked using FastQC and resulting data was aggregated using the FastQC module of MultiQC. For more quality controls and per sample quality checks you can check the full MultiQC report, which can be found in multiqc/multiqc_report.html.
Cutadapt is trimming primer sequences from sequencing reads. Primer sequences are non-biological sequences that often introduce point mutations that do not reflect sample sequences. This is especially true for degenerated PCR primer. If primer trimming were to be omitted, artifactual amplicon sequence variants might be computed by the denoising tool or sequences might be lost due to being labelled as PCR chimera. Primers were trimmed using cutadapt and all untrimmed sequences were discarded. Sequences that did not contain primer sequences were considered artifacts. Less than 2.2% of the sequences were discarded per sample and a mean of 98.2% of the sequences per sample passed the filtering. Cutadapt results can be found in folder cutadapt.
Additional quality filtering can improve sequence recovery. Often it is advised trimming the last few nucleotides to avoid less well-controlled errors that can arise there. Forward reads were trimmed at 220 bp and reverse reads were trimmed at 215 bp, reads shorter than this were discarded. Reads with more than 2 expected errors were discarded. Read counts passing the filter are shown in section ‘Read counts per sample’ column ‘filtered’.
Quality profiles:
Forward (left) and reverse (right) read quality stats for incoming data:
Forward (left) and reverse (right) read quality stats for preprocessed data:
Overall read quality profiles are displayed as heat map of the
frequency of each quality score at each base position. The mean quality
score at each position is shown by the green line, and the quartiles of
the quality score distribution by the orange lines. The red line shows
the scaled proportion of reads that extend to at least that position.
Original plots can be found in folder dada2/QC/ with names that end in
_qual_stats.pdf.
DADA2 performs fast and accurate sample inference from amplicon data with single-nucleotide resolution. It infers exact amplicon sequence variants (ASVs) from amplicon data with fewer false positives than many other methods while maintaining high sensitivity.
DADA2 reduces sequence errors and dereplicates sequences by quality filtering, denoising, read pair merging (for paired end Illumina reads only) and PCR chimera removal.
Read error correction was performed using estimated error rates, visualized below. Error rates for forward reads are at the left side and reverse reads are at the right side.
Estimated error rates are displayed for each possible transition. The
black line shows the estimated error rates after convergence of the
machine-learning algorithm. The red line shows the error rates expected
under the nominal definition of the Q-score. The estimated error rates
(black line) should be a good fit to the observed rates (points), and
the error rates should drop with increased quality. Original plots can
be found in folder dada2/QC/ with names that
end in .err.pdf.
Tracking read numbers through DADA2 processing steps for each sample. The following table shows the read numbers after each processing stage. Processing stages are: input - read pairs into DADA2, filtered - read pairs passed quality filtering, denoisedF - forward reads after denoising, denoisedR - reverse reads after denoising, merged - successfully merged read pairs, nonchim - read pairs in non-chimeric sequences (final ASVs).
Samples with unusual low reads numbers relative to the number of expected ASVs should be treated cautiously, because the abundance estimate will be very granular and might vary strongly between (theoretical) replicates due to high impact of stochasticity.
Following, the numbers of the table above are shown in stacked barcharts as percentage of DADA2 input reads. Stacked barchart of read pair numbers (denoisedF & denoisedR halfed, because each pair is split) per sample and processing stage:
Between 32.8% and 79.86% reads per sample (average 56.3%) were retained for analysis within DADA2 steps.
The proportion of lost reads per processing stage and sample should not be too high, totalling typically <50%. Samples that are very different in lost reads (per stage) to the majority of samples must be compared with caution, because an unusual problem (e.g. during nucleotide extraction, library preparation, or sequencing) could have occurred that might add bias to the analysis.
Finally, 1154 amplicon sequence variants (ASVs) were obtained across
all samples. The ASVs can be found in dada2/ASV_seqs.fasta. And the
corresponding quantification of the ASVs across samples is in dada2/ASV_table.tsv. An extensive
table containing both was saved as dada2/DADA2_table.tsv. ASVs were
initally inferred for each sample independently, but re-examined with
all samples (pseudo-pooled).
Barrnap classifies the ASVs into the origin domain (including mitochondrial origin).
Barrnap classified 0 ( 0 %) ASVs as most similar to Bacteria, 0 ( 0
%) ASVs to Archea, 0 ( 0 %) ASVs to Mitochondria, 0 ( 0 %) ASVs to
Eukaryotes, and 1154 ( 100 %) were below similarity threshold to any
kingdom.
rRNA classification results can be found in folder barrnap.
A length filter was used to reduce potential contamination. Before filtering, ASVs had the following length profile (count of 1 was transformed to 1.5 to allow plotting on log10 scale):
Filtering omitted all ASVs with length lower than 300 or above 340 bp.
The following table shows read counts for each sample before and after filtering:
In average 1.24875 % reads were removed, but at most 2.01 % reads per sample.
The number of ASVs was reduced by 61 ( 5.29 %), from 1154 to 1093 ASVs.
Length filter results can be found in folder asv_length_filter.
The taxonomic classification was performed by DADA2 using a custom database provided by the user.
DADA2 classified 70.54 % ASVs at Kingdom level, 31.93 % ASVs at Phylum level, 8.14 % ASVs at Class level, 0.91 % ASVs at Order level, 0.37 % ASVs at Family level, 0.37 % ASVs at Genus level, 0.37 % ASVs at Species level, 0 % ASVs at Species_exact level.
DADA2 taxonomy assignments can be found in folder dada2 in files ASV_tax_*.tsv.
Files that were input to QIIME2 can be found in folder qiime2/input/. Results of taxonomic classification of DADA2 was used in all following analysis, see in the above sections.
Unwanted taxa are often off-targets generated in PCR with primers
that are not perfectly specific for the target DNA. For 16S rRNA
sequencing mitrochondria and chloroplast sequences are typically removed
because these are frequent unwanted non-bacteria PCR products. ASVs were
removed when the taxonomic string contained any of
mitochondria,chloroplast (comma separated). Consequently,
1093 ASVs were reduced by 0 ( 0 %) to 1093 . The following table shows
read counts for each sample before and after filtering:
Tables with read count numbers and filtered abundance tables are in folder qiime2/abundance_tables.
The abundance tables are the final data for further downstream analysis and visualisations. The tables are based on the computed ASVs and taxonomic classification, but after removal of unwanted taxa. Folder qiime2/abundance_tables contains tap-separated files (.tsv) that can be opened by any spreadsheet software.
Absolute abundance tables produced by the previous steps contain count data, but the compositional nature of 16S rRNA amplicon sequencing requires sequencing depth normalisation. This step computes relative abundance tables using TSS (Total Sum Scaling normalisation) for various taxonomic levels and detailed tables for all ASVs with taxonomic classification, sequence and relative abundance for each sample. Typically used for in depth investigation of taxa abundances. Folder qiime2/rel_abundance_tables contains tap-separated files (.tsv) that can be opened by any spreadsheet software.
Interactive abundance plot that aids exploratory browsing the discovered taxa and their abundance in samples and allows sorting for associated meta data. Folder qiime2/barplot contains barplots, click qiime2/barplot/index.html to open it in your web browser.
Produces rarefaction plots for several alpha diversity indices, and is primarily used to determine if the richness of the samples has been fully observed or sequenced. If the slope of the curves does not level out and the lines do not become horizontal, this might be because the sequencing depth was too low to observe all diversity or that sequencing error artificially increases sequence diversity and causes false discoveries.
Folder qiime2/alpha-rarefaction contains the data, click qiime2/alpha-rarefaction/index.html to open it in your web browser.
Diversity measures summarize important sample features (alpha diversity) or differences between samples (beta diversity). Diversity calculations are based on sub-sampled data rarefied to 23228 counts.
Alpha diversity measures the species diversity within samples.
This step calculates alpha diversity using various methods and performs pairwise comparisons of groups of samples. It is based on a phylogenetic tree of all ASV sequences. Folder qiime2/diversity/alpha_diversity contains the alpha-diversity data:
Alpha diversity is considered not trustworthy when it correlates positively with sequencing depth. Spearman’s rank correlation was calculated for total counts per sample after all filtering steps (in folder qiime2/abundance_tables) with alpha diversity measures. No significant positive correlation was found between alpha diversity and sample counts:
Scatter plots with linear regression line (blue) with 95% confidence interval (gray shaded area).
Beta diversity measures the species community differences between samples. This step calculates beta diversity distances using various methods and performs pairwise comparisons of groups of samples. Additionally, principle coordinates analysis (PCoA) plots are produced that can be visualized with Emperor in your default browser without the need for installation. These calculations are based on a phylogenetic tree of all ASV sequences. Folder qiime2/diversity/beta_diversity contains the beta-diverity data:
Statistics on differences between specific metadata groups that can
be found in folder qiime2/diversity/beta_diversity/.
Each significance test result is in its separate folder following the
scheme {method}_distance_matrix-{treatment}:
qiime2/diversity/beta_diversity/bray_curtis_distance_matrix-Personnel/index.html
qiime2/diversity/beta_diversity/bray_curtis_distance_matrix-Primer/index.html
qiime2/diversity/beta_diversity/bray_curtis_distance_matrix-Site/index.html
qiime2/diversity/beta_diversity/jaccard_distance_matrix-Personnel/index.html
qiime2/diversity/beta_diversity/jaccard_distance_matrix-Primer/index.html
qiime2/diversity/beta_diversity/jaccard_distance_matrix-Site/index.html
qiime2/diversity/beta_diversity/unweighted_unifrac_distance_matrix-Personnel/index.html
qiime2/diversity/beta_diversity/unweighted_unifrac_distance_matrix-Primer/index.html
qiime2/diversity/beta_diversity/unweighted_unifrac_distance_matrix-Site/index.html
qiime2/diversity/beta_diversity/weighted_unifrac_distance_matrix-Personnel/index.html
qiime2/diversity/beta_diversity/weighted_unifrac_distance_matrix-Primer/index.html
qiime2/diversity/beta_diversity/weighted_unifrac_distance_matrix-Site/index.html
Analysis of Composition of Microbiomes (ANCOM) is applied to identify features that are differentially abundant across sample groups. A key assumption made by ANCOM is that few taxa (less than about 25%) will be differentially abundant between groups otherwise the method will be inaccurate. Comparisons between groups of samples is performed for specific metadata that can be found in folder qiime2/ancom/.
Test results are in separate folders following the scheme
Category-{treatment}-{taxonomic level}:
Phyloseq is a popular R package to analyse and visualize microbiom data. The produced RDS files contain phyloseq objects and can be loaded directely into R and phyloseq. The objects contain an ASV abundance table and a taxonomy table. If available, metadata and phylogenetic tree will also be included in the phyloseq object. The files can be found in folder phyloseq.
Data was processed using nf-core/ampliseq version 2.8.0 (doi: 10.5281/zenodo.1493841) (Straub et al., 2020) of the nf-core collection of workflows (Ewels et al., 2020), utilising reproducible software environments from the Bioconda (Grüning et al., 2018) and Biocontainers (da Veiga Leprevost et al., 2017) projects.
Data quality was evaluated with FastQC (Andrews, 2010) and summarized with MultiQC (Ewels et al., 2016). Cutadapt (Marcel et al., 2011) trimmed primers and all untrimmed sequences were discarded. Sequences that did not contain primer sequences were considered artifacts. Less than 2.2% of the sequences were discarded per sample and a mean of 98.2% of the sequences per sample passed the filtering. Adapter and primer-free sequences were processed initally independently, but re-examined as one pool (pseudo-pooled) with DADA2 (Callahan et al., 2016) to eliminate PhiX contamination, trim reads (forward reads at 220 bp and reverse reads at 215 bp, reads shorter than this were discarded), discard reads with > 2 expected errors, correct errors, merge read pairs, and remove polymerase chain reaction (PCR) chimeras; ultimately, 1154 amplicon sequencing variants (ASVs) were obtained across all samples. Between 32.8% and 79.86% reads per sample (average 56.3%) were retained. The ASV count table contained in total 280054 counts, at least 23704 and at most 41956 per sample (average 35007).
61 ASVs with length lower than 300 or above 340 bp were removed with less than 2.01% counts per sample (1093 ASVs passed).
Taxonomic classification was performed by DADA2 with a user provided database.
ASV sequences, abundance and DADA2 taxonomic assignments were loaded into QIIME2 (Bolyen et al., 2019).
Within QIIME2, the final microbial community data was visualized in a barplot, evaluated for sufficient sequencing depth with alpha rarefaction curves, investigated for alpha (within-sample) and beta (between-sample) diversity after rarefaction to 23228 counts, used to find differential abundant taxa with ANCOM (Mandal et al., 2015).
WARNING This methods section is lacking software versions, these can be found in MultiQC’s report section Software Versions or in folder pipeline_info file
software_versions.yml.
Taxonomic classification by DADA2:
MultiQC summarized computational methods in multiqc/multiqc_report.html. The proposed short methods description can be found in MultiQC’s Methods Description, versions of software collected at runtime in MultiQC’s Software Versions, and a summary of non-default parameter in MultiQC’s Workflow Summary.
Technical information to the pipeline run are collected in folder pipeline_info, including software versions
collected at runtime in file software_versions.yml (can be
viewed with a text editor), all parameter settings in file
params_{date}_{time}.json (can be viewed with a text
editor), execution report in file
execution_report_{date}_{time}.html, execution trace in
file execution_trace_{date}_{time}.txt, execution timeline
in file execution_timelime_{date}_{time}.html, and pipeline
direct acyclic graph (DAG) in file
pipeline_dag_{date}_{time}.html.
This report (file summary_report.html) is located in
folder summary_report of the original pipeline results
folder. In this file, all links to files and folders are relative,
therefore hyperlinks will only work when the report is at its original
place in the pipeline results folder. Plots specifically produced for
this report (if any) can be also found in folder summary_report.
A comprehensive read count report throughout the pipeline can be
found in the base results folder in file
overall_summary.tsv.
Please cite the pipeline publication and any software tools used by the pipeline (see citations) when you use any of the pipeline results in your study.